Systems and methods for determining traits based on voice analysis

ABSTRACT

Systems and methods are provided herein for determining one or more traits of a speaker based on voice analysis to present content item to the speaker. In one example, the method receives a voice query and determines whether the voice query matches within a first confidence threshold of a speaker identification (ID) among a plurality of speaker IDs stored in a speaker profile. In response to determining that the voice query matches to the speaker ID within the first confidence threshold, the method bypasses a trait prediction engine and retrieves a trait among the plurality of traits in the speaker profile associated with the matched speaker ID. The method further provides a content item based on the retrieved trait.

BACKGROUND

The present disclosure is directed to systems and methods fordetermining one or more traits of a speaker based on voice analysis topresent content to the speaker. In particular, systems and methods areprovided for either bypassing trait prediction or invoking the traitprediction based on confidence level in identity of the speakerdetermined from the voice analysis.

SUMMARY

Voice analysis applications use biometric fingerprints to uniquelyidentify voice for natural language understanding (NLU) to address usecases such as determining authentication of a speaker, providing contentpersonalized to the speaker. Such applications analyze audio signals ofa speaker to invoke a set of application programming interfaces, each ofwhich performs predictions such as identification and traits such as ageand gender of the speaker. Conventionally, these predictions areperformed by having an ensemble model approach, which may result ininaccuracy of the predictions. Such an ensemble model approach utilizesa supervised learning technique by creating multiple sub-system modelsto predict the identification and traits of a speaker by using differenttraining data sets. Typically, audio features of a speaker are analyzedto either simultaneously predict traits and identification of thespeaker or predict traits before the identification. Thus, sub-systemmodules are invoked every time during training and during live traitpredictions of the speaker, which are both expensive and time-consumingoperations.

To solve these problems, systems and methods are provided herein forbypassing one or more sub-systems to predict traits of the speaker. Toaccomplish this, the system relies on the identity of the speaker to bepredicted with a certain confidence level. Upon this identityprediction, one or more traits of the identified speaker are retrievedfrom an already determined trait profile. Thus, a trait predictionsub-system is invoked only when the identity of the speaker is notpredicted with the certain confidence level. Additionally, identity ofthe speaker is determined without utilizing trait(s) at the time oftraining and prediction.

In some embodiments, a voice query is received from a speaker. If thesystem determines that the speaker is identified with a certainconfidence level, then a trait associated with the identified speaker isretrieved from a profile and sent to the NLU. However, if the speaker isnot identified with the certain confidence level, then a traitprediction engine is invoked to determine trait of the speaker whiledynamically creating a new ID for the speaker. For example, a profile isalready created for a certain number of members living in a householdsuch that if the speaker is the member then the trait of a member can beretrieved from the profile. However, if the speaker is a guest visitingthe household and a profile does not exist for the guest, then the traitprediction engine is invoked to determine the trait of the guest andcreates a new ID for the guest.

In some embodiments, ID prediction and trait engine(s) are trained topredict the ID and one or more traits of the speaker. Such trainingincludes processing features of various types of the audio signals ofthe same speaker to correlate both with the ID and the one or moretraits corresponding to the ID. For example, speech features (pitch,frequency, etc.) are utilized to train the ID prediction engines toidentify the speaker and to train the trait engine to predict the traitof the speaker. In one embodiment, the trait engine(s) areupdated/trained based on the confidence level of the ID predictionengine. For example, if the ID prediction engine predicts the ID of aspeaker with a higher confidence level than before, then the traitengine(s) are updated by implying same audio of the speaker.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects and advantages of the present disclosurewill be apparent upon consideration of the following detaileddescription, taken in conjunction with the accompanying drawings, inwhich like reference characters refer to like parts throughout, and inwhich:

FIG. 1 shows an illustrative example of determining trait(s) based onvoice analysis to provide content item, in accordance with someembodiments of the disclosure;

FIG. 2 shows a block diagram of an illustrative example of a system fordetermining trait(s) based on voice analysis, in accordance with someembodiments of the disclosure;

FIG. 3 shows a block diagram of an illustrative system, in accordancewith some embodiments of the disclosure;

FIG. 4 depicts a block, diagram of an illustrative system, in accordancewith some embodiments of the disclosure;

FIG. 5 depicts a flowchart of illustrative steps for determiningtrait(s) based on voice analysis and providing content item based on thedetermined trait(s), in accordance with some embodiments of thedisclosure.

FIG. 6 depicts a flowchart of illustrative steps for updating trait(s)based on voice analysis, in accordance with some embodiments of thedisclosure.

DETAILED DESCRIPTION

Methods and systems are described herein for determining trait(s) of aspeaker based on voice analysis and providing content item based on thedetermined trait(s). In some embodiments, a trait determinationapplication determines whether a voice query matches within a confidencethreshold of a speaker identification (ID) among a plurality of speakerIDs stored in a speaker profile. The speaker profile comprises dataincluding a plurality of unique speaker IDs, a biometric fingerprint(hash) corresponding to each of the speaker IDs and one or more traitscorresponding to each of the speaker IDs. In one example, the speakerprofile provides data of members of a household. In one embodiment, themethod generates a hash from the voice query and compares it with a hashcorresponding to each of the speaker IDs to determine whether there is amatch within a confidence threshold. In one embodiment, the confidencethreshold is predetermined based on the speech features previouslycaptured from the voice query. The confidence threshold is used as astandard to analyze voice quality of the voice captured in real time. Inone embodiment, the method determines a match and bypasses a traitprediction engine. The method retrieves one or more traits correspondingto the speaker profile corresponding to the matched speaker ID andprovides a content item based on the retrieved trait(s). In oneembodiment, the content item is provided based on the trait(s)corresponding with the matched speaker ID. In one example, the voicequery is from a father of the household, and the content item is anadult-rated content item. In one example, the voice query is from achild, and the content item is a child-rated content item. In oneembodiment, the method does not determine a match within the confidencethreshold. In one example, the voice query is from a guest visiting thehousehold. The method invokes the trait prediction engine to predict atrait of the voice query that did not match within the confidencethreshold while dynamically generating a new speaker ID for the voicequery. The method creates a new entry in the speaker profile with thenewly generated speaker ID and the corresponding predicted trait.

As referred to herein, the “content item” should be understood to meanelectronically consumable assets, such as online games, virtual,augmented or mixed reality content, direct-to-consumer live streams(such as those provided by Twitch, for example), VR chat applications,VR video players, 360 video content, television programming, as well aspay-per-view programs, on-demand programs (as in video-on-demand (VOD)systems), Internet content (e.g., streaming content, downloadablecontent, Webcasts, etc.), video clips, audio, content information,pictures, rotating images, documents, playlists, websites, articles,books, electronic books, blogs, chat sessions, social media,applications, games, and/or any other media or multimedia and/orcombination of the same. As referred to herein, the term “multimedia”should be understood to mean content that utilizes at least twodifferent content forms described above, for example, text, audio,images, video, or interactivity content forms. Content may be recorded,played, displayed or accessed by user equipment devices, but can also bepart of a live performance.

In various embodiments described herein, “trait determinationapplication” is an application that leverages acoustic features of avoice query to determine trait of the speaker and provide content itembased on the trait to present to the speaker. In some embodiments, thetrait determination application may be provided as an on-lineapplication (i.e., provided on a website), or as a stand-aloneapplication on a server, user device, etc. Various devices and platformsthat may implement the trait determination application are described inmore detail below. In some embodiments, the trait determinationapplication, and/or any instructions for performing any of theembodiments discussed herein may be encoded on computer-readable media.Computer-readable media includes any capable of storing instructionsand/or data. The computer-readable may be transitory, including, but notlimited to, propagating electrical or electromagnetic signals, or may benon-transitory, including, but not limited to, volatile and nonvolatilecomputer memory or storage devices such as a hard disk, floppy disk, USBdrive, DVD, CD, card, register memory, processor caches, Random AccessMemory (“RAM”), etc.

FIG. 1 shows an illustrative example of a flow of operations of a traitdetermination application performed by, e.g., control circuitry 406(FIG. 4 ) for determining a trait of a speaker in accordance with someembodiments of the present disclosure. In particular, FIG. 1 shows ascenario 100 where a voice query 104 (e.g., query “Play a Funny Movie”)is received via user input/output device 105 (e.g., digital voiceassistant). In some embodiments, the query is received as voice inputfrom a speaker 102.

At block 106, a hash (hash value) is generated based on a voice query.In one embodiment, a hash is a biometric fingerprint generated (e.g.,audio processing 402 in FIG. 4 ) from the voice query. At block 108, thetrait determination application compares the generated hash to hashes ina speaker profile. As shown, in one example, a speaker profile 150 is adata structure such as a table, including Speaker ID 152, a hash 154corresponding to the Speaker ID 152 and traits 156 in the figure (e.g.,age range, gender, ethnicity) corresponding to the Speaker ID 152. Inone example, the speaker profile 150 is a profile of members living in ahousehold such as father, mother, boy and girl. Based on the comparison,at block 110, the trait determination application determines whether thegenerated hash matches with at least one hash among the hashes in thespeaker profile 150 within a first confidence threshold. In oneembodiment, the first confidence threshold is pre-determined based onthe voice previously captured from the voice query. The first confidencethreshold is used as a standard to analyze voice quality of the voicecaptured in real time. In one example, the first confidence threshold isin the range of 70 percent to 80 percent. In one example, the generatedmatch matches with Hash 1 within the range of 70 percent to 80 percent.In one embodiment, at block 112, the trait determination applicationretrieves trait(s) from the speaker profile based on the matching hashat block 110. In one example, the matched hash is Hash 1. Thus, it isdetermined that the speaker 102 is an English male in the age range of40-45 yrs. At block 114, the trait determination application displays“Funny Movie” based on the retrieved trait(s). At block 116, the traitdetermination application determines whether the matched hash matcheswith a score above the first confidence threshold. In one example, it isdetermined whether the generated hash that matched with Hash 1 wasmatched with a 90 percent score, which is higher than the range of 70percent to 80 percent of the first confidence threshold. Thus, biometricfingerprint of the voice captured from the voice query is determined tobe of a higher quality. In one example, it is determined from thebiometric fingerprint of the higher quality that the ethnicity isHispanic. At block 118, the trait determination application updates thetrait(s) and/or matched hash based on the voice query. In one example,Hash 1 is replaced with the generated hash in the speaker profile 150.In one example, the ethnicity of English of Speaker ID 1 is replacedwith Hispanic in the speaker profile 150. In one embodiment, the traitdetermination application determines a trait to be updated based on atrait threshold. In one embodiment, the trait threshold ispre-determined based on the trait previously determined from the voicequery. The trait threshold is used as a standard to analyze accuracy ofthe trait determined from the voice query captured in real time. In oneexample, the trait is the age range and the trait threshold is an agerange of 40-50 years. For example, it is determined from the biometricfingerprint of the higher quality that the age range is 51-55 years,which does not fall within the trait threshold. In one example, the agerange of 40-45 years of Speaker ID 1 is replaced with 51-55 years in thespeaker profile 150. In another example, both the ethnicity of Englishof the Speaker ID 1 is replaced with Hispanic and the age range of 40-45years of the Speaker ID is replaced with 50-55 years in the speakerprofile 150.

Returning back to block 110, when it is determined that the generatedhash does not match with any hash among the hashes in speaker profile150 within the first confidence threshold, the trait determinationapplication analyzes the voice query using a trait prediction engine(e.g., 206 in FIG. 2 ) to predict the trait at block 112. In oneexample, it is determined that the generated hash does not match any ofthe hashes 154 in the speaker profile 150 within the range of 70 to 80percent. In one example, the speaker 102 is an elderly 80-year-oldfemale guest (e.g., “Grandma”) visiting the household. As such, thespeaker 102 is an unknown speaker. At block 122, the trait determinationapplication creates a new entry in the speaker profile. In one example,the new entry includes the unknown speaker as speaker ID, hash 5 andpredicted traits such as age range of 75-90-year-old female of Spanishethnicity. At block 124, the trait determination application adds thisnew entry to the speaker profile 150.

FIG. 2 illustrates an example of an exemplary system 200 for determiningtraits based on voice analysis. In some embodiments, the system includesan audio processing circuitry 202, a control circuitry 204, anidentification (ID) prediction engine 206, a trait prediction engine 208and a database 210. The audio processing circuitry 202 performs thevoice processing application by utilizing acoustic features extractedfrom audio of the voice query (e.g., 102) to identify the speaker.

In some embodiments, the audio processing circuitry 202 performs a voiceprocessing application such as automatic speech recognition (ASR) byutilizing acoustic features extracted from audio of the voice query. Insome embodiments, the voice processing application compares acousticfeatures of raw audio from the voice query with previously determinedacoustic features to determine whether there is a match. In someembodiments, the voice processing application may generate a biometricfingerprint uniquely identifying the voice of the speaker. For example,the voice processing application may identify unique characteristics orfeatures of the voice of the speaker (e.g., tone, pitch, pace, etc.) andmay store in a data structure a value for each of those features thatare unique to the speaker. For example, the voice processing applicationmay determine a unique pitch value, tone value and pace associated withthe speaking of the speaker and may store those values in a profile ofthe speaker. In one embodiment, the voice processing application maygenerate a biometric fingerprint for the voice input (e.g., by analyzingthe features of the voice input). In one embodiment, the voiceprocessing application may generate the biometric fingerprint for eachspeaker of a plurality of speakers (e.g., plurality of speakers havingaccess to input/output device 105). For example, the voice processingapplication may generate a biometric fingerprint for the voice input(e.g., by analyzing the features of the voice input as discussed above).In one embodiment, the biometric fingerprint is a hash value (e.g., hash154 of FIG. 1 ) stored in the speaker profile 150.

In one embodiment, the control circuitry 204 performs the traitdetermination application by comparing features in the generatedbiometric fingerprint of the voice input to the features of eachbiometric fingerprint of the plurality of biometric fingerprints to finda match. In some embodiments, the trait determination application maycompare the generated biometric fingerprint to a plurality of biometricfingerprints stored in the database 210 (e.g. speaker profile 150),wherein each biometric fingerprint in the database is associated with aunique speaker identification (ID) (e.g. Speaker ID 142 of FIG. 1 )among a plurality of speaker IDs stored in the database 210. In oneembodiment, the trait determination application determines that thegenerated biometric fingerprints matches with at least one fingerprintstored in the database 210 within a first confidence threshold and thusbypasses the ID prediction engine 206 and trait prediction engine 208and retrieves trait of the speaker from the database 210 correspondingto the matched fingerprint.

In one embodiment, the trait determination application determines thatthe generated biometric fingerprint does not match the storedfingerprint within the first confidence threshold and thus triggers theID prediction engine 206 and the trait prediction engine 208. In oneembodiment, the ID prediction engine 206 predicts a new speaker ID basedon the generated biometric fingerprint that does not match with thestored fingerprint within the first confidence threshold. In oneembodiment, the ID prediction engine 206 is trained to predict the newspeaker ID based on the biometric fingerprints generated from the voicequery.

In one embodiment, the ID prediction engine 206 is trained using audiosignals from the voice query. In one embodiment, a plurality ofcharacteristics or features of audio signals from a voice (e.g., tone,pitch, pace, etc.) are provided to a gated recurrent unit (GRU). In oneembodiment, the GRU extracts significant features at an utterance level.In one embodiment, the GRU is a subset of long short-term memory (LSTM)where it has a forget gate to selectively choose significant featuresamong the plurality of features. In one embodiment, the GRU utilizescomputing principal component analysis (PCA) to apply a signal or vectortechnique to extract the most variant features from a specific signal(or vector) and same across all the signals across a within class set.In one embodiment, the extracted most variant features are fed into aconvolutional neural network (CNN). In one embodiment, a tag is used asa speaker ID among the plurality of speakers (e.g. members) within anarea (e.g. household). In one embodiment, tags are encoded using basicencoding before feeding into an ID prediction model. In one embodiment,the ID prediction model is trained on a fair training set for eachspeaker and the system may use suitable loss functions such as subspaceloss, or a triplet loss function for example, residual CNN. Additionaldetails of utilizing the CNN for training data for speaker recognitionare provided inhttps://www.groundai.com/project/few-shot-speaker-recognition-using-deep-neural-networks/1,which is incorporated by reference herein in its entirety.

In one embodiment, an output of the ID prediction model is an ID of thespeaker which is predicted with a confidence level within a confidencethreshold. Additional details of confidence level in speakeridentification to identify a speaker is provided in US PatentPublication No. 2017/0301353A, which is incorporated by reference hereinin its entirety, and inhttps://docs.microsoft.com/en-in/azure/cognitive-services/speaker-recognition/home#identification,which is also incorporated by reference herein in its entirety. In oneembodiment, the speaker ID of the speaker within the area is used as aperceptual hash function in order to pinout the speaker (e.g. member)within the area (e.g. household) and further predict traits of thespeaker by using the predicted speaker ID.

In one embodiment, the trait prediction engine 208 predicts a new traitfor the generated fingerprint that do not match with the storedfingerprint within the first confidence threshold. In one embodiment,the trait prediction engine 208 is trained to predict the new traitbased on the biometric fingerprint generated from the voice query andthe new speaker ID. In one embodiment, the trait prediction engine 208is trained using the audio signals similarly as discussed above withrespect to training the ID prediction engine 206. In one embodiment, thetraits are predicted using the speaker ID determined by the IDprediction engine 206. In one embodiment, a tag is used as a trait amonga plurality of traits. In one embodiment, tags are encoded using basicencoding before feeding into a trait prediction model. In oneembodiment, the trait prediction model is trained on a fair training setfor each speaker and the system may use suitable loss functions such assubspace loss, or a triplet loss function in case of a residual CNN. Anoutput of the trait prediction model is a trait of the speaker which ispredicted within a confidence level of trait threshold. Additionaldetails of training data to identify a speaker as male or female areprovided in U.S. Pat. No. 6,424,946B1, which is incorporated byreference herein in its entirety. In one embodiment, the traitdetermination application stores the new speaker ID, the generatedfingerprint and the newly predicted trait corresponding to the newspeaker ID in the database 210.

In one embodiment, the trait determination application determines thatthe generated fingerprint that matched with a stored fingerprint has ascore higher than the first confidence threshold. In one embodiment, thetrait determination application updates the matched biometricfingerprint with the generated biometric fingerprint from the voicequery in the database 210. For example, the trait determinationapplication replaces the matched biometric fingerprint with thegenerated biometric fingerprint in the database 210. In one embodiment,the trait determination application triggers the trait prediction engine208 to update the trait based on the generated biometric fingerprintfrom the voice query. In one embodiment, the trait determinationapplication determines a trait to be updated based on the traitthreshold. As discussed above, in one embodiment, the trait threshold ispre-determined based on the trait previously determined from the voicequery. The trait threshold is used as a standard to analyze accuracy ofthe trait determined from the voice query captured in real time. In oneembodiment, the trait determination application determines that thegenerated biometric fingerprint from the voice query does not fallwithin the trait threshold. In one embodiment, the trait predictionengine 208 is trained to predict the updated trait based on thebiometric fingerprint generated from the voice query.

Although only one trait prediction engine 208 is shown to predict and/orupdate trait(s), it is known to one of ordinary skill in the art that aplurality of trait prediction engines may be used to separately predictand/or update traits such as age, gender, ethnicity, race, behavior,emotions, etc. In one embodiment, the trait prediction engine among theplurality of trait prediction engines in which the predicted trait isless than the trait threshold is triggered to update the predictedtrait.

FIGS. 3-4 describe exemplary devices, systems, servers, and relatedhardware for determining trait(s) based on voice analysis to providecontent item to the speaker. FIG. 3 shows a generalized embodiment ofillustrative server 302 connected with illustrative remote userequipment device 318. More specific implementation of the devices isdiscussed below in connection with FIG. 3 .

System 300 is depicted having server 302 connected with remote userequipment 318 (e.g., a user's digital voice assistant or a user'ssmartphone) via communications network 314. For convenience, because thesystem 300 is described from the perspective of the server 302, theremote user equipment 318 is described as being remote (i.e., withrespect to the server 302). The remote user equipment 318 may beconnected to the communications network 314 via a wired or wirelessconnection and may receive content and data via input/output(hereinafter “I/O”) path 320. The server 302 may be connected to thecommunications network 314 via a wired or wireless connection and mayreceive content and data via I/O path 304. The I/O path 304 and/or theI/O path 320 may provide content (e.g., broadcast programming, on-demandprogramming, Internet content, and other video, audio, or information)and data to remote control circuitry 330 and/or control circuitry 324,which includes remote processing circuitry 334 and storage 332, and/orprocessing circuitry 310 and storage 308. The remote-control circuitry330 may be used to send and receive commands, requests, and othersuitable data using the I/O path 320. The I/O path 320 may connect theremote-control circuitry 330 (and specifically remote processingcircuitry 334) to one or more communications paths (described below).Likewise, the control circuitry 306 may be used to send and receivecommands, requests, and other suitable data using the I/O path 304. I/Ofunctions may be provided by one or more of these communications pathsbut are shown as a single path in FIG. 3 to avoid overcomplicating thedrawing.

The remote-control circuitry 330 and the control circuitry 306 may bebased on any suitable remote processing circuitry such as processingcircuitry based on one or more microprocessors, microcontrollers,digital signal processors, programmable logic devices, etc. In someembodiments, the control circuitry 306 executes instructions for a voiceprocessing application, natural language processing application, and atrait determination application stored in memory (i.e., the storage308). In client-server-based embodiments, the control circuitry 306 mayinclude communications circuitry suitable for communicating with remoteuser equipment (e.g., the remote user equipment 318) or other networksor servers. For example, the trait determination application may includea first application on the server 302 and may communicate via the I/Opath 312 over the communications network 314 to the remote userequipment 318 associated with a second application of the traitdetermination application. Additionally, the other ones of the voiceprocessing, natural language processing may be stored in the remotestorage 332. In some embodiments, the remote-control circuitry, theremote-control circuitry 330 may execute the voice processingapplication to bypass a train prediction engine and retrieve a traitassociated with voice query from a speaker to provide content based onthe retrieved trait to the speaker. In other embodiments, theremote-control circuitry 330 may execute the trait determinationapplication to bypass a train prediction engine and retrieve a traitassociated with voice query from a speaker to provide content based onthe retrieved trait to the server 302. The trait determinationapplication (or any of the other applications) may coordinatecommunication over communications circuitry between the firstapplication on the server and the second application on the remote userequipment. Communications circuitry may include a modem or othercircuitry for connecting to a wired or wireless local or remotecommunications network. Such communications may involve the Internet orany other suitable communications networks or paths (which is describedin more detail in connection with FIG. 4 ). In addition, communicationscircuitry may include circuitry that enables peer-to-peer communicationof user equipment devices (e.g., WIFI-direct, Bluetooth, etc.), orcommunication of user equipment devices in locations remote from eachother.

Memory (e.g., random-access memory, read-only memory, or any othersuitable memory), hard drives, optical drives, or any other suitablefixed or removable storage devices may be provided as the remote storage332 and/or the storage 308. The remote storage 332 and/or the storage308 may include one or more of the above types of storage devices. Theremote storage 332 and/or storage 308 may be used to store various typesof content described herein and voice processing application data,natural language processing data, Trait determination application dataincluding content such as speaker profile including speaker ID, hash,traits (age, gender, ethnicity etc.) or other data used in operating thevoice processing application, natural language processing applicationand trait determination application. Nonvolatile memory may also be used(e.g., to launch a boot-up routine and other instructions). Although theapplications are described as being stored in the storage 306 and/or theremote storage 332, the applications may include additional hardware orsoftware that may not be included in storages 308 and 332.

A speaker may control the remote-control circuitry 330 using user inputinterface 322. The user input interface 322 may be any suitable userinterface, such as a remote control, mouse, trackball, keypad, keyboard,touch screen, touch pad, stylus input, joystick, microphone, voicerecognition interface, or other user input interfaces. Display 324 maybe provided as a stand-alone device or integrated with other elements ofthe remote user equipment 318. The display 312 may be one or more of amonitor, a television, a liquid crystal display (LCD) for a mobiledevice, or any other suitable equipment for displaying visual images.Speakers 314 may be provided as integrated with other elements of theremote user equipment 318 or may be stand-alone units.

The voice processing application, natural language processingapplication, and a trait determination application may be implementedusing any suitable architecture. For example, they may be a stand-aloneapplication wholly implemented on the server 302. In other embodiments,some of the application may be client-server-based application. Forexample, the voice processing application may be a client-server-basedapplication. Data for use by a thick or thin client implemented onremote user equipment 318 may be retrieved on-demand by issuing requeststo a server (e.g., the server 302) remote to the user equipment. Inother embodiments, the server may be omitted, and the application may beimplemented on the remote user equipment.

In some embodiments, as described above, the voice processingapplication, natural language processing application, and a traitdetermination application may be implemented on the server 302. In thisexample, the remote user equipment 318 simply provides captured audio ofa voice query to the server 302. However, this is only an example, andin other embodiments the applications may be implemented on a pluralityof devices (e.g., the remote user equipment 318 and the server 302) toexecute the features and functionalities of the applications. Theapplications may be configured such that features that requireprocessing capabilities beyond the remote user equipment 318 areperformed on the server 302 server while other capabilities of theapplications are performed on remote user equipment 332.

Though exemplary system 300 is depicted having two devices implementingthe voice processing application, natural language processingapplication, and a personalized content application, any number ofdevices may be used.

System 300 of FIG. 3 can be implemented in system 400 of FIG. 4 as usertelevision equipment 402, user computer equipment 404, wireless usercommunications device 405, voice assistant device 424, or any other typeof user equipment suitable for interfacing with the voice processingapplication, natural language processing application and personalizedcontent application. For simplicity, these devices may be referred toherein collectively as user equipment or user equipment devices. Userequipment devices, on which an application is at least partiallyimplemented, may function as a standalone device or may be part of anetwork of devices (e.g., each device may comprise an individual moduleof the personalized content application). Various network configurationsof devices may be implemented and are discussed in more detail below.

User television equipment 402 may include a set-top box, an integratedreceiver decoder (IRD) for handling satellite television, a televisionset, a digital storage device, a DVD recorder, a local server, or otheruser television equipment. One or more of these devices may beintegrated to be a single device, if desired. User computer equipment404 may include a PC, a laptop, a tablet, a personal computer television(PC/TV), a PC server, a PC center, or other user computer equipment.Wireless user communications device 406 may include a mobile telephone,a portable video player, a portable music player, a portable gamingmachine, a wireless remote control, or other wireless devices. Voiceassistant device 424 may include a smart speaker, a standalone voiceassistant, smarthome hub, etc.

It should be noted that the lines have become blurred when trying toclassify a device as one of the above devices. In fact, each of usertelevision equipment 402, user computer equipment 404, wireless usercommunications device 406, voice control device 424, and IOT device 428may utilize at least some of the system features described above inconnection with FIG. 3 and, as a result, include some or all of thefeatures of the voice processing application, natural languageprocessing application and trait determination application describedherein. For example, user television equipment 402 may implement a voiceprocessing application that is activated upon detecting a voice inputcomprising a keyword. The voice processing application may also have thesame layout on the various different types of user equipment or may betailored to the display capabilities of the user equipment. For example,on user computer equipment 406, the voice processing application may beprovided in a visual layout where the voice processing application mayrecite audio prompts of the voice processing application. In anotherexample, the voice processing application may be scaled down forwireless user communications devices. In another example, the voiceprocessing application may not provide a GUI and may listen to anddictate audio to a user such as voice assistant device 424, which insome instances, may not comprise a display.

In system 300, there is typically more than one of each type of userequipment device but only one of each is shown in FIG. 4 to avoidovercomplicating the drawing. In addition, each speaker may utilize morethan one type of user equipment device (e.g., a speaker may have atelevision set and a computer) and also more than one of each type ofuser equipment device (e.g., a speaker may have a digital voiceassistant device and a mobile telephone and/or multiple IOT devices).

The user equipment devices may be coupled to communications network 414.Namely, user television equipment 402, user computer equipment 404, andwireless user communications device 406 are coupled to communicationsnetwork 414 via communications paths 408, 410, and 412, respectively.Communications network 414 may be one or more networks including theInternet, a mobile phone network, mobile device (e.g., iPhone) network,cable network, public switched telephone network, or other types ofcommunications network or combinations of communications networks. Paths408, 410, and 412 may separately or together include one or morecommunications paths, such as, a satellite path, a fiber-optic path, acable path, a path that supports Internet communications (e.g., IPTV),free-space connections (e.g., for broadcast or other wireless signals),or any other suitable wired or wireless communications path orcombination of such paths. Path 412 is drawn with dotted lines toindicate that in the exemplary embodiment shown in FIG. 4 it is awireless path and paths 408 and 410 are drawn as solid lines to indicatethey are wired paths (although these paths may be wireless paths, ifdesired). Communications with the user equipment devices may be providedby one or more of these communications paths but are shown as singlepaths in FIG. 4 to avoid overcomplicating the drawing.

Although communications paths are not drawn between user equipmentdevices, these devices may communicate directly with each other viacommunication paths, such as those described above in connection withpaths 408, 410, and 412, as well other short-range point-to-pointcommunication paths, wireless paths (e.g., Bluetooth, infrared, IEEE902-11x, etc.), or other short-range communication via wired or wirelesspaths. BLUETOOTH is a certification mark owned by Bluetooth SIG, INC.The user equipment devices may also communicate with each other directlythrough an indirect path via communications network 414.

System 400 includes Speaker content database (e.g., table structure ofSpeaker Profile 150 of FIG. 1 ) 416, content data source (e.g., “FunnyMovie” of FIG. 1 ) 418, and trait determination processing server 426coupled to communications network 414 via communication paths 420, 422,and 428, respectively. Paths 420, 422, 428 may include any of thecommunication paths described above in connection with paths 408, 410,and 412. Communications with the speaker content database (database) 416and content data source (source) 418 may be exchanged over one or morecommunications paths but are shown as a single path in FIG. 4 to avoidovercomplicating the drawing. In addition, there may be more than one ofeach of database 416 and source 418, but only one of each is shown inFIG. 4 to avoid overcomplicating the drawing. If desired, speakercontent database 416 and source 418 may be integrated as one device.Although communications between the database 416 and the source 418 withuser equipment devices 402, 404, 406, 424, and 428 are shown as throughcommunications network 414, in some embodiments, the database 416 andthe source 418 may communicate directly with user equipment devices 402,404, 406, 424, and 428 via communication paths (not shown) such as thosedescribed above in connection with paths 408, 410, and 412.

Database 416 may store or index a plurality of speaker profile data(e.g., speaker ID, hash, traits such as age, gender, ethnicity etc.) ofthe speaker used for bypassing the trait prediction engine, retrievingthe trait based on voice query and providing the content based on theretrieved trait. In some embodiments, database 416 may index thelocation of the speaker profile data located on servers located remotelyor local to database 416. In some embodiments, in response toidentification of the speaker, the trait determination application mayaccess the index stored on database 416 and may identify a server (e.g.,a database stored on a server) comprising the trait of the identifiedspeaker. For example, the trait determination application may receive avoice query requesting a content item and determine that voice querymatches a speaker ID stored in the speaker profile in the database 416within a confidence threshold. The trait determination applicationbypasses the trait prediction engine, retrieves trait corresponding tothe matched speaker ID from the database 416 and provides a firstcontent item from the content data source 418. In another example, thetrait determination application may receive a voice query requesting acontent item and determine that voice query does not match any speakerID stored in the speaker profile in the database 416 within theconfidence threshold. The trait determination application may create anew entry in speaker profile in the database 416 for the unmatched voicequery and invoke the trait prediction engine to determine trait of theunmatched voice query. In a further example, the trait determinationapplication may receive a voice query requesting a content item anddetermine that voice query that matched the speaker ID stored in thespeaker profile in the database 416 within the confidence threshold hasa score greater than the confidence threshold. The trait determinationapplication updates the trait stored in the database 416 correspondingto the matched speaker ID.

Source 418 may provide data used during the operation or function of thepersonalized content application. For example, source may store contentitems and functions associated with the personalized contentapplication, etc. In some embodiments, updates for the traitdetermination application may be downloaded via source 418.

The trait determination application may be, for example, a stand-aloneapplication implemented on user equipment devices. In other embodiments,trait determination application may be a client-server application whereonly the client resides on the user equipment device. For example, thetrait determination application may be implemented partially as a clientapplication on control circuitry 304 of devices 402, 404, 406, 424,and/or 428 and partially on a remote server as a server application(e.g., source 418, database 416, or server 426). The guidanceapplication displays and/or voice control application displays may begenerated by the source 418, database 416, trait determinationprocessing server 426 and transmitted to the user equipment devices. Thesource 418, database 416, and trait determination processing server 426may also transmit data for storage on the user equipment, which thengenerates the voice control application displays and audio based oninstructions processed by control circuitry.

System 400 is intended to illustrate a number of approaches, orconfigurations, by which user equipment devices and sources and serversmay communicate with each other. The present invention may be applied inany one or a subset of these approaches, or in a system employing otherapproaches for delivering and providing a voice control application.

FIG. 5 is a flowchart of an illustrative process 500 for determining atrait based on voice analysis to present content to the speaker, inaccordance with some embodiments of the disclosure. In some embodiments,each step of the process 500 can be performed by server 302 (e.g., viacontrol circuitry 306) or by remote user equipment device 318 (e.g., viacontrol circuitry 330) in FIG. 3 .

Process 500 begins at block 502, where the control circuitry receives avoice query. In one example, the voice query is a voice query 104,(e.g., “Play Funny Movie”) as illustrated in FIG. 1 . At block 504,control circuitry determines whether the voice query matches within afirst confidence threshold of a speaker identification (ID) among aplurality of speaker IDs stored in a speaker profile. In one embodiment,a biometric fingerprint or hash is generated from a voice query (e.g.,104 in FIG. 1 ) and compared with the hashes (e.g., 154 in FIG. 1 ) in aspeaker profile (e.g., 150 in FIG. 1 ). In one example, the speakerprofile contains profile data (speaker ID, corresponding hash andtrait(s) of all members of the household. In one embodiment, the firstconfidence threshold is pre-determined based on the voice previouslycaptured from the voice query. The first confidence threshold is used asa standard to analyze voice quality of the voice captured in real time.In one example, the first confidence threshold is in the range of 70percent to 80 percent. When, at block 504, it is determined that thevoice query matches within the first confidence threshold of a speakeridentification (ID) among the plurality of speaker IDs, then at block506, the control circuitry bypasses a trait prediction engine. In oneexample, the generated match matches with Hash 1 within the range of 70percent to 80 percent. At block 508, the control circuitry retrieves atrait among a plurality of traits in the speaker profile associated withthe matched speaker ID. In one example, the matched speaker ID isSpeaker ID 1 and the corresponding traits are an English male in the agerange of 40-45 years as illustrated in the speaker profile 150 in FIG. 1. At block 510, the control circuitry provides a first content itembased on the retrieved trait. In one example, the first content item(e.g., “Funny Movie” in FIG. 1 ) is displayed.

Returning back to block 504, when it is determined that the voice querydoes not match within the first confidence threshold of a speakeridentification (ID) among the plurality of speaker IDs, then at block512, the control circuitry triggers the trait prediction engine todetermine a second trait based on the voice query. In one embodiment, itis determined that the speaker profile data is not in the speakerprofile (e.g., 150 in FIG. 1 ). In one example, the speaker is not amember of the household. In some embodiments, a new entry is created inthe speaker profile. In one embodiment, a biometric fingerprint or hashgenerated from the voice query is utilized by the trait predictionengine to determine a trait of the speaker. In one embodiment, thebiometric fingerprint or hash generated from the voice query is utilizedby the ID prediction engine to create a speaker ID of the voice query.In one example, the new entry includes the speaker ID as unknownspeaker, the hash as hash 5 and predicted traits as illustrated in FIG.1 . In one embodiment, the new entry is added into the speaker profile.

FIG. 6 is a flowchart of an illustrative process 600 for updating thetrait in the speaker profile based on voice analysis, in accordance withsome embodiments of the disclosure. In some embodiments, each step ofthe process 600 can be performed by server 302 (e.g., via controlcircuitry 306) or by remote user equipment device 318 (e.g., via controlcircuitry 330) in FIG. 3 .

Process 600 begins at block 602, where the control circuitry determineswhether the voice query that matched the speaker ID within the firstconfidence threshold matches with a confidence score greater than thefirst confidence threshold. As discussed above, in one embodiment, thefirst confidence threshold is pre-determined based on the voicepreviously captured from the voice query. The first confidence thresholdis used as a standard to analyze voice quality of the voice captured inreal time. In one example, the first confidence threshold is in therange of 70 percent to 80 percent. When, at block 602, it is determinedthat the voice query matched the speaker ID within the first confidencethreshold matches with the confidence score greater than the firstconfidence threshold, then at block 604, the control circuitry updatesthe biometric fingerprint associated with the matched speaker ID. In oneexample, it is determined whether the generated hash that matched withHash 1 (FIG. 1 ) was matched with 90 percent score, which is higher thanthe range of 70 percent to 80 percent of the first confidence threshold.Thus, biometric fingerprint of the voice captured from the voice queryis determined to be of a higher quality. In one example, it isdetermined from the biometric fingerprint of the higher quality that theethnicity is Hispanic. At block 606, the control circuitry updates thetrait associated with the matched speaker ID. In one example, theethnicity of English of Speaker ID 1 is replaced with Hispanic in thespeaker profile (e.g., 150 in FIG. 1 ). In one embodiment, the controlcircuitry determines a trait to be updated based on a trait threshold.As discussed above, the trait threshold is pre-determined based on thetrait previously determined from the voice query. The trait threshold isused as a standard to analyze accuracy of the trait determined from thevoice query captured in real time. In one example, the trait is the agerange and the trait threshold is an age range of 40-50 years. Forexample, it is determined from the biometric fingerprint of the higherquality that the age range is 51-55 years, which does not fall withinthe trait threshold. In one example, the age range of 40-45 years ofSpeaker ID 1 is replaced with 51-55 years in the speaker profile 150(e.g., 150 in FIG. 1 ). In another example, both the ethnicity ofEnglish of Speaker ID 1 is replaced with Hispanic and the age range of40-45 years of Speaker ID 1 is replaced with 50-55 years in the speakerprofile 150 (e.g., 150 in FIG. 1 ).

It is contemplated that the steps or descriptions FIGS. 5-6 may be usedwith any other embodiment of this disclosure. In addition, thedescriptions described in relation to the algorithms of FIGS. 5-6 may bedone in alternative orders or in parallel to further the purposes ofthis disclosure. For example, conditional statements and logicalevaluations may be performed in any order or in parallel orsimultaneously to reduce lag or increase the speed of the system ormethod. As a further example, in some embodiments, several instances ofa variable may be evaluated in parallel, using multiple logicalprocessor threads, or the algorithm may be enhanced by incorporatingbranch prediction. Furthermore, it should be noted that the processes ofFIGS. 5-6 may be implemented on a combination of appropriatelyconfigured software and hardware, and that any of the devices orequipment discussed in relation to FIGS. 1-4 could be used to implementone or more portions of the process.

The processes discussed above are intended to be illustrative and notlimiting. One skilled in the art would appreciate that the steps of theprocesses discussed herein may be omitted, modified, combined and/orrearranged, and any additional steps may be performed without departingfrom the scope of the invention. More generally, the above disclosure ismeant to be exemplary and not limiting. Only the claims that follow aremeant to set bounds as to what the present invention includes.Furthermore, it should be noted that the features and limitationsdescribed in any one embodiment may be applied to any other embodimentherein, and flowcharts or examples relating to one embodiment may becombined with any other embodiment in a suitable manner, done indifferent orders, or done in parallel. In addition, the systems andmethods described herein may be performed in real time. It should alsobe noted that the systems and/or methods described above may be appliedto, or used in accordance with, other systems and/or methods.

What is claimed is:
 1. A method comprising: receiving a voice query;determining, whether a voice of the voice query matches within a firstconfidence threshold a speaker identification (ID) among a plurality ofspeaker IDs stored in a speaker profile; in response to determining thatthe voice matches to the speaker ID within the first confidencethreshold: bypassing a trait prediction engine; retrieving a trait amonga plurality of traits stored in the speaker profile associated with thematched speaker ID; and providing a first content item based on theretrieved trait; and in response to determining that the voice does notmatch to any of the speaker IDs within the first confidence threshold:triggering the trait prediction engine to determine a second trait basedon characteristics of the voice; and providing a second content itembased on the second trait.
 2. The method of claim 1, further comprising:updating the profile with the second trait.
 3. The method of claim 1,further comprising: in response to determining that the voice query doesnot match to any of the speaker IDs within the first confidencethreshold: creating a new speaker ID associated with the second trait,wherein the new speaker ID is generated based on the received voicequery.
 4. The method of claim 1 further comprising: determining that thevoice query matched to the speaker ID within the first confidencethreshold matched with a confidence score that is greater than the firstconfidence threshold; updating the trait associated with the matchedspeaker ID with the confidence score greater than the first confidencethreshold.
 5. The method of claim 4, wherein the updating the retrievedtrait further comprises: training the trait prediction engine to predictthe updated trait based on biometric fingerprint generated from thevoice query matched to the speaker ID with the confidence score greaterthan the first confidence threshold.
 6. The method of claim 1 furthercomprising: determining that the voice query matched to the speaker IDwithin the first confidence threshold matched with a confidence scorethat is greater than the first confidence threshold; determining thatthe retrieved trait associated with the matched speaker ID with theconfidence score greater than the first confidence threshold is notwithin a trait threshold; and updating the retrieved trait.
 7. Themethod of claim 1 further comprising: determining that the voice querymatched to the speaker ID within the first confidence threshold matchedwith a confidence score that is greater than the first confidencethreshold; and updating a biometric fingerprint associated with thematched speaker ID.
 8. The method of claim 7 wherein updating thebiometric fingerprint further comprises: replacing the biometricfingerprint associated with the matched speaker ID with a biometricfingerprint generated from the voice query.
 9. The method of claim 1wherein determining whether the voice query matches within a firstconfidence threshold further comprising: generating a biometricfingerprint from the voice query; comparing the generated biometricfingerprint with a plurality of biometric fingerprints stored in thespeaker profile, wherein each of the plurality of biometric fingerprintscorresponds to a respective speaker ID among the plurality of speakerIDs.
 10. A system comprising: a memory configured to store a speakerprofile, a control circuitry coupled to the memory and configured to:determine, whether a voice of a voice query matches within a firstconfidence threshold a speaker identification (ID) among a plurality ofspeaker IDs stored in the speaker profile; in response to determiningthat the voice matches to the speaker ID within the first confidencethreshold: bypass a trait prediction engine; retrieve a trait among aplurality of traits stored in the speaker profile associated with thematched speaker ID; and provide a first content item based on theretrieved trait; and in response to determining that the voice does notmatch to any of the speaker IDs within the first confidence threshold:trigger the trait prediction engine to determine a second trait based oncharacteristics of the voice; and provide a second content item based onthe second trait.
 11. The system of claim 10, wherein the controlcircuitry is configured to: update the profile with the second trait.12. The system of claim 10, wherein the control circuitry is configuredto: in response to determining that the voice query does not match toany of the speaker IDs within the first confidence threshold: create anew speaker ID associated with the second trait, wherein the new speakerID is generated based on the received voice query.
 13. The system ofclaim 10 wherein the control circuitry is configured to: determine thatthe voice query matched to the speaker ID within the first confidencethreshold matched with a confidence score that is greater than the firstconfidence threshold; update the trait associated with the matchedspeaker ID with the confidence score greater than the first confidencethreshold.
 14. The system of claim 13 wherein to update the retrievedtrait the control circuitry is configured to: train the trait predictionengine to predict the updated trait based on biometric fingerprintgenerated from the voice query matched to the speaker ID with theconfidence score greater than the first confidence threshold.
 15. Thesystem of claim 10 wherein the control circuitry is configured to:determine that the voice query matched to the speaker ID within thefirst confidence threshold matched with a confidence score that isgreater than the first confidence threshold; determine that theretrieved trait associated with the matched speaker ID with theconfidence score greater than the first confidence threshold is notwithin a trait threshold; and update the retrieved trait.
 16. The systemof claim 10 wherein the control circuitry is configured to: determinethat the voice query matched to the speaker ID within the firstconfidence threshold matched with a confidence score that is greaterthan the first confidence threshold; and update a biometric fingerprintassociated with the matched speaker ID.
 17. The system of claim 16wherein to update the biometric fingerprint, the control circuitry isconfigured to: replace the biometric fingerprint associated with thematched speaker ID with a biometric fingerprint generated from the voicequery.
 18. The system of claim 10 wherein to determine whether the voicequery matches within a first confidence threshold, the control circuitryis configured to: generate a biometric fingerprint from the voice query;compare the generated biometric fingerprint with a plurality ofbiometric fingerprints stored in the speaker profile, wherein each ofthe plurality of biometric fingerprints correspond to a respectivespeaker ID among the plurality of speaker IDs.
 19. A system comprising:a memory configured to store a speaker profile, a control circuitrycoupled to the memory and configured to: determine, whether a voice of avoice query matches within a first confidence threshold a speakeridentification (ID) among a plurality of speaker IDs stored in thespeaker profile; in response to determining that the voice matches tothe speaker ID within the first confidence threshold: bypass a traitprediction engine; retrieve a trait among a plurality of traits storedin the speaker profile associated with the matched speaker ID; andprovide a first content item based on the retrieved trait; determinethat the voice query matched to the speaker ID within the firstconfidence threshold matches with a confidence score greater than thefirst confidence threshold; update the trait associated with the matchedspeaker ID with the confidence score greater than the first confidencethreshold, wherein to update the retrieved trait the control circuitryis configured to: train the trait prediction engine to predict theupdated trait based on biometric fingerprint generated from the voicequery matched to the speaker ID with the confidence score greater thanthe first confidence threshold.
 20. A method comprising: receiving avoice query; determining, whether a voice of the voice query matcheswithin a first confidence threshold a speaker identification (ID) amonga plurality of speaker IDs stored in a speaker profile; in response todetermining that the voice matches the speaker ID within the firstconfidence threshold: bypassing a trait prediction engine; retrieving atrait among a plurality of traits stored in the speaker profileassociated with the matched speaker ID; and providing a first contentitem based on the retrieved trait; in response to determining that thevoice query matched to the speaker ID within the first confidencethreshold matches with a confidence score that is greater than the firstconfidence threshold; updating the trait associated with the matchedspeaker ID with the confidence score greater than the first confidencethreshold, wherein the updating comprises: training the trait predictionengine to predict the updated trait based on biometric fingerprintgenerated from the voice query matched to the speaker ID with theconfidence score greater than the first confidence threshold.